{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Character Text Splitter\n",
    "\n",
    "- Author: [hellohotkey](https://github.com/hellohotkey)\n",
    "- Peer Review : [fastjw](https://github.com/fastjw), [heewung song](https://github.com/kofsitho87)\n",
    "- Proofread : [JaeJun Shim](https://github.com/kkam-dragon)\n",
    "- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)\n",
    "\n",
    "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/07-TextSplitter/01-CharacterTextSplitter.ipynb)[![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/07-TextSplitter/01-CharacterTextSplitter.ipynb)\n",
    "## Overview\n",
    "\n",
    "Text splitting is a crucial step in document processing with LangChain. \n",
    "\n",
    "The ```CharacterTextSplitter``` offers efficient text chunking that provides several key benefits:\n",
    "\n",
    "- **Token Limits:** Overcomes LLM context window size restrictions\n",
    "- **Search Optimization:** Enables more precise chunk-level retrieval\n",
    "- **Memory Efficiency:** Processes large documents effectively\n",
    "- **Context Preservation:** Maintains textual coherence through ```chunk_overlap```\n",
    "\n",
    "This tutorial explores practical implementation of text splitting through core methods like ```split_text()``` and ```create_documents()```, including advanced features such as metadata handling.\n",
    "\n",
    "### Table of Contents\n",
    "\n",
    "- [Overview](#overview)\n",
    "- [Environement Setup](#environment-setup)\n",
    "- [CharacterTextSplitter Example](#charactertextsplitter-example)\n",
    "\n",
    "\n",
    "### References\n",
    "\n",
    "- [LangChain TextSplitter](https://python.langchain.com/api_reference/text_splitters/base/langchain_text_splitters.base.TextSplitter.html)\n",
    "- [LangChain CharacterTextSplitter](https://python.langchain.com/api_reference/text_splitters/character/langchain_text_splitters.character.CharacterTextSplitter.html)\n",
    "----"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Environment Setup\n",
    "\n",
    "Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.\n",
    "\n",
    "**[Note]**\n",
    "- ```langchain-opentutorial``` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. \n",
    "- You can checkout the [```langchain-opentutorial```](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture --no-stderr\n",
    "%pip install langchain-opentutorial"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install required packages\n",
    "from langchain_opentutorial import package\n",
    "\n",
    "package.install(\n",
    "    [\n",
    "        \"langchain_text_splitters\",\n",
    "    ],\n",
    "    verbose=False,\n",
    "    upgrade=False,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Environment variables have been set successfully.\n"
     ]
    }
   ],
   "source": [
    "# Set environment variables\n",
    "from langchain_opentutorial import set_env\n",
    "\n",
    "set_env(\n",
    "    {\n",
    "        \"OPENAI_API_KEY\": \"\",\n",
    "        \"LANGCHAIN_API_KEY\": \"\",\n",
    "        \"LANGCHAIN_TRACING_V2\": \"true\",\n",
    "        \"LANGCHAIN_ENDPOINT\": \"https://api.smith.langchain.com\",\n",
    "        \"LANGCHAIN_PROJECT\": \"Adaptive-RAG\",  # title 과 동일하게 설정해 주세요\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can alternatively set API keys such as ```OPENAI_API_KEY``` in a ```.env``` file and load them.\n",
    "\n",
    "[Note] This is not necessary if you've already set the required API keys in previous steps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Load API keys from .env file\n",
    "from dotenv import load_dotenv\n",
    "\n",
    "load_dotenv(override=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## CharacterTextSplitter Example\n",
    "\n",
    "Read and store contents from keywords file\n",
    "* Open ```./data/appendix-keywords.txt``` file and read its contents.\n",
    "* Store the read contents in the ```file``` variable"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(\"./data/appendix-keywords.txt\", encoding=\"utf-8\") as f:\n",
    "   file = f.read()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Print the first 500 characters of the file contents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Semantic Search\n",
      "\n",
      "Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.\n",
      "Example: Vectors of word embeddings can be stored in a database for quick access.\n",
      "Related keywords: embedding, database, vectorization, vectorization\n",
      "\n",
      "Embedding\n",
      "\n",
      "Definition: Embedding is the process of converting textual data, such as words or sentences, into a low-dimensional, continuous vector. This allows computers to unders\n"
     ]
    }
   ],
   "source": [
    "print(file[:500])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create ```CharacterTextSplitter``` with parameters:\n",
    "\n",
    "**Parameters**\n",
    "\n",
    "* ```separator```: String to split text on (e.g., newlines, spaces, custom delimiters)\n",
    "* ```chunk_size```: Maximum size of chunks to return\n",
    "* ```chunk_overlap```: Overlap in characters between chunks\n",
    "* ```length_function```: Function that measures the length of given chunks\n",
    "* ```is_separator_regex```: Boolean indicating whether separator should be treated as a regex pattern"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_text_splitters import CharacterTextSplitter\n",
    "\n",
    "text_splitter = CharacterTextSplitter(\n",
    "   separator=\" \",           # Splits whenever a space is encountered in text\n",
    "   chunk_size=250,          # Each chunk contains maximum 250 characters\n",
    "   chunk_overlap=50,        # Two consecutive chunks share 50 characters\n",
    "   length_function=len,     # Counts total characters in each chunk\n",
    "   is_separator_regex=False # Uses space as literal separator, not as regex\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create document objects from chunks and display the first one"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "page_content='Semantic Search\n",
      "\n",
      "Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.\n",
      "Example: Vectors of word embeddings can be stored in a database for quick'\n"
     ]
    }
   ],
   "source": [
    "chunks = text_splitter.create_documents([file])\n",
    "print(chunks[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Demonstrate metadata handling during document creation:\n",
    "\n",
    "* ```create_documents``` accepts both text data and metadata lists\n",
    "* Each chunk inherits metadata from its source document"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "page_content='Semantic Search\n",
      "\n",
      "Definition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.\n",
      "Example: Vectors of word embeddings can be stored in a database for quick' metadata={'document': 1}\n"
     ]
    }
   ],
   "source": [
    "# Define metadata for each document\n",
    "metadatas = [\n",
    "   {\"document\": 1},\n",
    "   {\"document\": 2},\n",
    "]\n",
    "\n",
    "# Create documents with metadata\n",
    "documents = text_splitter.create_documents(\n",
    "   [file, file],  # List of texts to split\n",
    "   metadatas=metadatas,  # Corresponding metadata\n",
    ")\n",
    "\n",
    "print(documents[0])  # Display first document with metadata"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Split text using the ```split_text()``` method.\n",
    "* ```text_splitter.split_text(file)[0]``` returns the first chunk of the split text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Semantic Search\\n\\nDefinition: A vector store is a system that stores data converted to vector format. It is used for search, classification, and other data analysis tasks.\\nExample: Vectors of word embeddings can be stored in a database for quick'"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Split the file text and return the first chunk\n",
    "text_splitter.split_text(file)[0]"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "langchain-opentutorial-BBeXTkJT-py3.11",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
